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(54) Efficient algoritlim for finding candidate objects for remote differential compression 

(57) The present invention finds candidate objects 
{0^,0^) for remote differential compression. Objects 
(Og,Oy^) are updated between two or more computing 
devices (1 00, 1 01 ) using remote differential compression 
(RDC) techniques such that required data transfers are 
minimized. An algorithm provides enhanced efficiencies 
for allowing the receiver to locate a set of objects that are 
similar to the object that needs to be transferred from the 
sender. Once this set of similar objects has been found, 
the receiver may reuse any chunks (chunk 1 ... chunk k; 
chunk 1... chunk n), from these objects during the RDC 
algorithm. 
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Description 

Background of the Invention 

5 [0001] The proliferation of networks sucin as intranets, extranets, and the internet has lead to a large growth in the 
number of users that share information across wide networks. A maximum data transfer rate is associated with each 
physical network based on the bandwidth associated with the transmission medium as well as other Infrastructure related 
limitations. As a result of limited network bandwidth, users can experience long delays In retrieving and transferring large 
amounts of data across the network. 

10 [0002] Data compression techniques have become a popular way to transfer large amounts of data across a network 
with limited bandwidth. Data compression can be generally characterized as either lossless or lossy. Lossless compres- 
sion Involves the transformation of a data set such that an exact reproduction of the data set can be retrieved by applying 
a decompression transformation. Lossless compression Is most often used to compact data, when an exact replica Is 
required. 

15 [0003] In the case where the recipient of a data object already has a previous, or older, version of that object, a lossless 
compression approach called Remote Differential Compression (RDC) may be used to determine and only transfer the 
differences between the new and the old versions of the object. Since an RDC transfer only Involves communicating 
the observed differences between the new and old versions (for instance, in the case of files, file modification or last 
access dates, file attributes, or small changes to the file contents), the total amount of data transferred can be greatly 

20 reduced. RDC can be combined with another lossless compression algorithm to further reduce the network traffic. The 
benefits of RDC are most significant In the case where large objects need to be communicated frequently back and forth 
between computing devices and It is difficult or infeaslble to maintain old copies of these objects, so that local differential 
algorithms cannot be used. 

25 Summary of the Invention 

[0004] Briefly stated, the present invention is related to a method and system for finding candidate objects for remote 
differential compression. Objects are updated between two or more computing devices using remote differential com- 
pression (RDC) techniques such that required data transfers are minimized. In one aspect, an algorithm provides en- 
30 hanced efficiencies by allowing the sender to communicate a small amount of meta-data to the receiver, and the receiver 
to use this meta-data to locate a set of objects that are similar to the object that needs to be transferred from the sender. 
Once this set of similar objects has been found, the receiver may reuse any parts of these objects as needed during the 
RDC algorithm. 

[0005] A more complete appreciation of the present invention and its improvements can be obtained by reference to 
35 the accompanying drawings, which are briefly summarized below, to the following detailed description of illustrative 
embodiments of the Invention, and to the appended claims. 

Brief Description of the Drawings 

40 [0006] Non-limiting and non-exhaustive embodiments of the present invention are described with reference to the 
following drawings. 

FIG. 1 Is a diagram illustrating an operating environment; 
FIG. 2 is a diagram illustrating an example computing device; 
45 FIGS. 3A and 3B are diagrams illustrating an example RDC procedure; 

FIGS. 4A and 4B are diagrams Illustrating process flows for the Interaction between a local device and a remote 
device during an example RDC procedure; 

FIGS. 5A and 5B are diagrams illustrating process flows for recursive remote differential compression of the signature 

and chunk length lists in an example interaction during an RDC procedure; 
50 FIG. 6 is a diagram that graphically illustrates an example of recursive compression In an example RDC sequence; 

FIG. 7 Is a diagram illustrating the Interaction of a client and server application using an example RDC procedure; 

FIG. 8 is a diagram illustrating a process flow for an example chunking procedure; 

FIG. 9 is a diagram of example instruction code for an example chunking procedure; 

FIGS. 10 and 1 1 are diagrams of example Instruction code for another example chunking procedure; 
55 FIG. 12 Illustrates an RDC algorithm modified to find and use candidate objects; 

FIG. 13 and 14 show a process and an example of a trait computation; 

FIGS. 15 and 16 may be used when selecting the parameters for b and t; 

FIG. 1 7 illustrates data structures that make up a compact representation of: an Object Map and a set of Trait Tables; 
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and 

FIG. 18 illustrates a process for computing similar traits, in accordance with aspects of the present invention. 
Detailed Description of the Preferred Embodiment 

5 

[0007] Various embodiments of the present invention will be described in detail with reference to the drawings, where 
like reference numerals represent like parts and assemblies throughout the several views. Reference to various embod- 
iments does not limit the scope of the invention, which is limited only by the scope of the claims attached hereto. 
Additionally, any examples set forth in this specification are not intended to be limiting and merely set forth some of the 

10 many possible embodiments for the claimed invention. 

[0008] The present invention is described in the context of local and remote computing devices (or "devices", for short) 
that have one or more commonly associated objects stored thereon. The terms "local" and "remote" refer to one instance 
of the method. However, the same device may play both a "local" and a "remote" role in different instances. Remote 
Differential Compression (RDC) methods are used to efficiently update the commonly associated objects over a network 

15 with limited-bandwidth. When a device having a new copy of an object needs to update a device having an older copy 
of the same object, or of a similar object, the RDC method is employed to only transmit the differences between the 
objects over the network. An example described RDC method uses (1) a recursive approach for the transmission of the 
RDC metadata, to reduce the amount of metadata transferred for large objects, and (2) a local maximum-based chunking 
method to increase the precision associated with the object differencing such that bandwidth utilization is minimized. 

20 Some example applications that benefit from the described RDC methods include: peer-to-peer replication services, 
file-transfer protocols such as SMB, virtual servers that transfer large images, email servers, cellular phone and PDA 
synchronization, database server replication, to name just a few. 

Operating Environment 

25 

[0009] FIG. 1 is a diagram illustrating an example operating environment for the present invention. As illustrated in 
the figure, devices are arranged to communicate over a network. These devices may be general purpose computing 
device, special purpose computing devices, or any other appropriate devices that are connected to a network. The 
network 102 may correspond to any connectivity topology including, but not limited to: a direct wired connection (e.g., 
30 parallel port, serial port, USB, IEEE 1394, etc), a wireless connection (e.g., IR port, Bluetooth port, etc.), awired network, 
a wireless network, a local area network, a wide area network, an ultra-wide area network, an internet, an intranet, and 
an extranet. 

[0010] In an example interaction between device A (1 GO) and device B (1 01 ), different versions of an object are locally 
stored on the two devices: object on 100 and object Og on 1 01 . At some point, device A (1 00) decides to update its 

35 copy of object with the copy (object Og) stored on device B (101), and sends a request to device B (1 01 ) to initiate 
the RDC method. In an alternate embodiment, the RDC method could be initiated by device B (101). 
[0011] Device A (1 00) and device B (1 01 ) both process their locally stored object and divide the associated data into 
a variable number of chunks in a data-dependent fashion (e.g., chunks 1 - n for object Og, and chunks 1 - k for object 
O^, respectively). A set of signatures such as strong hashes (SHA) for the chunks are computed locally by both the 

40 devices. The devices both compile separate lists of the signatures. During the next step of the RDC method, device B 
(101) transmits its computed list of signatures and chunk lengths 1 - n to device A (100) over the network 102. Device 
A (100) evaluates this list of signatures by comparing each received signature to its own generated signature list 1 - k. 
Mismatches in the signature lists indicate one or more differences in the objects that require correction. Device A (1 00) 
transmits a request for device B (101) to send the chunks that have been identified by the mismatches in the signature 

45 lists. Device B (101) subsequently compresses and transmits the requested chunks, which are then reassembled by 
device A (100) after reception and decompression are accomplished. Device A (100) reassembles the received chunks 
together with its own matching chunks to obtain a local copy of object Og. 

Example Computing Device 

50 

[0012] FIG. 2 is a block diagram of an example computing device that is arranged in accordance with the present 
invention. In a basic configuration, computing device 200 typically includes at least one processing unit (202) and system 
memory (204). Depending on the exact configuration and type of computing device, system memory 204 may be volatile 
(such as RAM), non-volatile (such as ROM, flash memory, etc.) or some combination of the two. System memory 204 
55 typically includes an operating system (205); one or more program modules (206); and may include program data (207). 
This basic configuration is illustrated in FIG. 2 by those components within dashed line 208. 

[0013] Computing device 200 may also have additional features or functionality. For example, computing device 200 
may also include additional data storage devices (removable and/or non-removable) such as, for example, magnetic 
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disks, optical disks, or tape. Such additional storage is illustrated in FIG. 2 by removable storage 209 and non-removable 
storage 210. Computer storage media may include volatile and non-volatile, removable and non-removable media 
implemented in any method or technology for storage of information, such as computer readable instructions, data 
structures, program modules or other data. System memory 204, removable storage 209 and non-removable storage 

5 210 are ail examples of computer storage media. Computer storage media includes, but is not limited to, RAM, ROM, 
EEPROM, flash memory or other memory technology, CD-ROM, digital versatile disks (DVD) or other optical storage, 
magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices, or any other medium 
which can be used to store the desired information and which can be accessed by computing device 200. Any such 
computer storage media may be part of device 200. Computing device 200 may also have input device (s) 212 such as 

10 keyboard, mouse, pen, voice input device, touch input device, etc. Output device(s) 214 such as a display, speakers, 
printer, etc. may also be included. All these devices are known in the art and need not be discussed at length here. 
[001 4] Computing device 200 also contains communications connection(s) 21 6 that allow the device to communicate 
with other computing devices 218, such as over a network. Communications connection(s) 216 is an example of com- 
munication media. Communication media typically embodies computer readable instructions, data structures, program 

15 modules or other data in a modulated data signal such as a carrier wave or other transport mechanism and includes 
any information delivery media. The term "modulated datasignal" means asignal that has one or more of its characteristics 
set or changed in such a manner as to encode information in the signal. By way of example, and not limitation, commu- 
nication media includes wired media such as a wired network or direct-wired connection, and wireless media such as 
acoustic, RF, microwave, satellite, infrared and other wireless media. The term computer readable media as used herein 

20 includes both storage media and communication media. 

[0015] Various procedures and interfaces may be implemented in one or more application programs that reside in 
system memory 204. In one example, the application program is a remote differential compression algorithm that sched- 
ules file synchronization between the computing device (e.g., a client) and another remotely located computing device 
(e.g., a server). In another example, the application program is a compression/decompression procedure that is provided 

25 in system memory 204 for compression and decompressing data. In still another example, the application program is a 
decryption procedure that is provided in system memory 204 of a client device. 

Remote Differential Compression (RDC) 

30 [001 6] FIGS. 3A and 3B are diagrams illustrating an example RDC procedure according to at least one aspect of the 
present invention. The number of chunks in particular can vary for each instance depending on the actual objects 
and Og. 

[0017] Referring to FIG. 3 A, the basic RDC protocol is negotiated between two computing devices (device A and 
device B). The RDC protocol assumes implicitly that the devices A and B have two different instances (or versions) of 
35 the same object or resource, which are identified by object instances (or versions) O^ and Og, respectively. For the 
example illustrated in this figure, device A has an old version of the resource O^, while device B has a version Og with 
a slight (or incremental) difference in the content (or data) associated with the resource. 

[0018] The protocol for transferring the updated object Og from device B to device A is described below. A similar 
protocol may be used to transfer an object from device A to device B, and that the transfer can be initiated at the behest 
40 of either device A or device B without significantly changing the protocol described below. 

1 . Device A sends device B a request to transfer Object Og using the RDC protocol. In an alternate embodiment, 
device B initiates the transfer; in this case, the protocol skips step 1 and starts at step 2 below. 

2. Device A partitions Object O^^ into chunks 1 - k, and computes a signature Sig^^j and a length (or size in bytes) 
45 Len^^i for each chunk 1 ... k of Object O^. The partitioning into chunks will be described in detail below. Device A 

stores the list of signatures and chunk lengths ((Sig^i, Len^i)-.. (Sig^k^ LenAk))- 

3. Device B partitions Object Og into chunks 1 - n, and computes a signature Siggj and a length Lengj for each chunk 
1... n of Object Og. The partitioning algorithm used in step 3 must match the one in step 2 above. 

4. Device B sends a list of its computed chunk signatures and chunk lengths ((Sigg-|, Leng-|) ... (Sigg^, Leng^,)) that 
50 are associated with Object Og to device A. The chunk length information may be subsequently used by device A 

to request a particular set of chunks by identifying them with their start offset and their length. Because of the 
sequential nature of the list, it is possible to compute the starting offset in bytes of each chunk Bi by adding up the 
lengths of all preceding chunks in the list. In another embodiment, the list of chunk signatures and chunk lengths is 
compactly encoded and further compressed using a lossless compression algorithm before being sent to device A. 
55 5. Upon receipt of this data, device A compares the received signature list against the signatures Sig^i ... Sig^^ that 

it computed for Object O^ in step 2, which is associated with the old version of the content. 

6. Device A sends a request to device B for all the chunks whose signatures received in step 4 from device B failed 
to match any of the signatures computed by device A in step 2. For each requested chunk Bi, the request comprises 
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the chunk start offset computed by device A in step 4 and the chunk length. 

7. Device B sends the content associated with all the requested chunks to device A. The content sent by device B 
may be further compressed using a lossless compression algorithm before being sent to device A. 

8. Device A reconstructs a local copy of Object by using the chunks received in step 7 from device B, as well 

5 as its own chunks of Object O^ that matched signatures sent by device B in step 4. The order in which the local and 

remote chunks are rearranged on device A is determined by the list of chunksignatures received by device A in step 4. 

[0019] The partitioning steps 2 and 3 may occur in a data-dependent fashion that uses a fingerprinting function that 
is computed at every byte position in the associated object (O^ and Og, respectively). For a given position, the finger- 
10 printing function is computed using a small data window surrounding that position in the object; the value of the finger- 
printing function depends on all the bytes of the object included in that window. The fingerprinting function can be any 
appropriate function, such as, for example, a hash function or a Rabin polynomial. 

[0020] Chunk boundaries are determined at positions in the Object for which the fingerprinting function computes to 
a value that satisfies a chosen condition. The chunk signatures may be computed using a cryptographically secure hash 

15 function (SHA), or some other hash function such as a collision-resistant hash function. 

[0021] The signature and chunk length list sent in step 4 provides a basis for reconstructing the object using both the 
original chunks and the identified updated or new chunks. The chunks that are requested in step 6 are identified by their 
offset and lengths. The object is reconstructed on device A by using local and remote chunks whose signatures match 
the ones received by device A in step 4, in the same order. 

20 [0022] After the reconstruction step is completed by device A, Object O^ can be deleted and replaced by the copy of 
Object Og that was reconstructed on device A. In other embodiments, device A may keep Object O^ around for potential 
"reuse" of chunks during future RDC transfers. 

[0023] For large objects, the basic RDC protocol instance illustrated in FIG. 3A incurs a significant fixed overhead in 
Step 4, even if Object O^^ and Object Og are very close, or identical. Given an average chunk size C, the amount of 
25 information transmitted over the network in Step 4 is proportional to the size of Object Og, specifically it is proportional 
to the size of Object Og divided by C, which is the number of chunks of Object B, and thus of (chunk signature, chunk 
length) pairs transmitted in step 4. 

[0024] For example, referring to FIG. 6, a large image (e.g., a virtual hard disk image used by a virtual machine monitor 
such as, for example, Microsoft Virtual Server) may result in an Object (Og) with a size of 9.1 GB. For an average chunk 

30 size 0 equal to 3KB, the 9GB object may result in 3 million chunks being generated for Object Og, with 42MB of associated 
signature and chunk length information that needs to be sent over the network in Step 4. Since the 42MB of signature 
information must be sent over the network even when the differences between Object O^ and Object Og (and thus the 
amount of data that needs to be sent in Step 7) are very small, the fixed overhead cost of the protocol is excessively high. 
[0025] This fixed overhead cost can be significantly reduced by using a recursive application of the RDC protocol 

35 instead of the signature information transfer in step 4. Referring to FIG. 3B, additional steps 4.2 - 4.8 are described as 
follows below that replace step 4 of the basic RDC algorithm. Steps 4.2 - 4.8 correspond to a recursive application of 
steps 2 - 8 of the basic RDC protocol described above. The recursive application can be further applied to step 4.4 
below, and so on, up to any desired recursion depth. 

40 4.2. Device A performs a recursive chunking of its signature and chunk length list ((Sig^i, Len^i) ... (Sig^k. Len/j^i^)) 

into recursive signature chunks, obtaining another list of recursive signatures and recursive chunk lengths ((RSig^^-,, 
RLenj(^i) ... (RSig^s^g' RLen/^s))' where s « k. 

4.3. Device B recursively chunks up the list of signatures and chunk lengths ((Sigg-,, Leng-,) ... (Sigg^, Leng^)) to 
produce a list of recursive signatures and recursive chunk lengths ((RSigg^, RLeng^) ... (RSigg^, RLeng^)), where r 

45 « n. 

4.4. Device B sends an ordered listof recursive signatures and recursive chunk lengths ((RSigg-,, RLeng-|) ... (RSigg^, 
RLeng^)) to device A. The list of recursive chunk signatures and recursive chunk lengths is compactly encoded and 
may be further compressed using a lossless compression algorithm before being sent to device A. 

4.5. Device A compares the recursive signatures received from device B with its own list of recursive signatures 
50 computed in Step 4.2. 

4.6. Device A sends a request to device B for every distinct recursive signature chunk (with recursive signature 
RSigg^^) for which device A does not have a matching recursive signature in its set (RSig^^ ... RSig^s)- 

4.7. Device B sends device A the requested recursive signature chunks. The requested recursive signature chunks 
may be further compressed using a lossless compression algorithm before being sent to device A. 

55 4.8. Device A reconstructs the list of signatures and chunk information ((Sigg^, Leng^) ... (Sigg^, Leng^)) using the 

locally matching recursive signature chunks, and the recursive chunks received from device B in Step 4.7. 

[0026] After step 4.8 above is completed, execution continues at step 5 of the basic RDC protocol described above. 
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which is illustrated in FIG. 3A. 

[0027] As a result of the recursive chunking operations, the number of recursive signatures associated with the objects 
is reduced by a factor equal to the average chunk size C, yielding a significantly smaller number of recursive signatures 
(r « n for object and s « k for object O^, respectively). In one embodiment, the same chunking parameters could 
5 be used for chunking the signatures as for chunking the original objects O^^ and Og. In an alternate embodiment, other 
chunking parameters may be used for the recursive steps. 

[0028] For very large objects the above recursive steps can be applied k times, where k > 1 . For an average chunk 
size of C, recursive chunking may reduce the size of the signature traffic over the network (steps 4.2 through 4.8) by a 
factor approximately corresponding to C'^. Since C is relatively large, a recursion depth of greater than one may only be 
10 necessary for very large objects. 

[0029] In one embodiment, the number of recursive steps may be dynamically determined by considering parameters 
that include one or more of the following: the expected average chunk size, the size of the objects O^^ and/or Og, the 
data format of the objects and/or O^, the latency and bandwidth characteristics of the network connecting device A 
and device B. 

15 [0030] The fingerprinting function used in step 2 is matched to the fingerprinting function that is used in step 3. Similarly, 
the fingerprinting function used in step 4.2 is matched to the fingerprinting function that is used in step 4.3. The finger- 
printing function from steps 2 - 3 can optionally be matched to the fingerprinting function from steps 4.2 - 4.3. 
[0031 ] As described previously, each fingerprinting function uses a small data window that surrounds a position in the 
object; where the value associated with the fingerprinting function depends on all the bytes of the object that are included 

20 inside the data window. The size of the data window can be dynamically adjusted based on one or more criteria. 
Furthermore, the chunking procedure uses the value of the fingerprinting function and one or more additional chunking 
parameters to determine the chunk boundaries in steps 2 - 3 and 4.2 - 4.3 above. 

[0032] By dynamically changing the window size and the chunking parameters, the chunk boundaries are adjusted 
such that any necessary data transfers are accomplished with minimal consumption of the available bandwidth. 

25 [0033] Example criteria for adjusting the window size and the chunking parameters include: a data type associated 
with the object, environmental constraints, a usage model, the latency and bandwidth characteristics of the network 
connecting device A and device B, and any other appropriate model for determining average data transfer block sizes. 
Example data types include word processing files, database images, spreadsheets, presentation slide shows, and 
graphic images. An example usage model may be where the average number of bytes required in atypical data transfer 

30 is monitored. 

[0034] Changes to a single element within an application program can result in a number of changes to the associated 
datum and/or file. Since most application programs have an associated file type, the file type is one possible criteria that 
is worthy of consideration in adjusting the window size and the chunking parameters. In one example, the modification 
of a single character in a word processing document results in approximately 100 bytes being changed in the associated 
35 file. In another example, the modification of a single element in a database application results in 1000 bytes being 
changed in the database index file. For each example, the appropriate window size and chunking parameters may be 
different such that the chunking procedure has an appropriate granularity that is optimized based on the particular 
application. 

^0 Example Process Flow 

[0035] FIGS. 4A and 4B are diagrams illustrating process flows for the interaction between a local device (e.g., device 
A) and a remote device (e.g., device B) during an example RDC procedure that is arranged in accordance with at least 
one aspect of the present invention. The left hand side of FIG. 4A illustrates steps 400 - 413 that are operated on the 

45 local device A, while the right hand side of FIG. 4A illustrates steps 450 - 456 that are operated on the remote device B. 
[0036] As illustrated in FIG. 4A, the interaction starts by device A requesting an RDC transfer of object in step 400, 
and device B receiving this request in step 450. Following this, both the local device A and remote device B independently 
compute fingerprints in steps 401 and 451 , divide their respective objects into chunks in steps 402 and 452, and compute 
signatures (e.g., SHA) for each chunk in steps 403 and 453, respectively. 

50 [0037] In step 454, device B sends the signature and chunk length list computed in steps 452 and 453 to device A, 
which receives this information in step 404. 

[0038] In step 405, the local device A initializes the list of requested chunks to the empty list, and initializes the tracking 
offset for the remote chunks to 0. In step 406, the next (signature, chunk length) pair (Siggj, Len^j) is selected for 
consideration from the list received in step 404. In step 407, device A checks whether the signature Siggj selected in 
55 step 406 matches any of the signatures it computed during step 403. If it matches, execution continues at step 409. If 
it doesn't match, the tracking remote chunk offset and the length in bytes Lengj are added to the request list in step 408. 
At step 409, the tracking offset is incremented by the length of the current chunk Len^j. 

[0039] In step 41 0, the local device A tests whether all (signature, chunk length) pairs received in step 404 have been 
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processed. If not, execution continues at step 406. Otiierwise, tine cliunk request list is suitably encoded in a compact 
fashion, compressed, and sent to the remote device B at step 41 1 . 

[0040] The remote device B receives the compressed list of chunks at step 455, decompresses it, then compresses 
and sends back the chunk data at step 456. 
5 [0041] The local device receives and decompresses the requested chunk data at step 412. Using the local copy of 
the object and the received chunk data, the local devices reassembles a local copy of Og at step 413. 
[0042] FIG. 4B illustrates a detailed example for step 413 from FIG. 4A. Processing continues at step 414, where the 
local device A initializes the reconstructed object to empty. 

[0043] In step 415, the next (signature, chunk length) pair (Sigej, Lengj) is selected for consideration from the list 
10 received in step 404. In step 41 6, device A checks whether the signature Siggj selected in step 41 7 matches any of the 
signatures it computed during step 403. 

[0044] If it matches, execution continues at step 417, where the corresponding local chunk is appended to the recon- 
structed object. If it doesn't match, the received and decompressed remote chunk is appended to the reconstructed 
object in step 41 8. 

15 [0045] In step 41 9, the local device A tests whether all (signature, chunk length) pairs received in step 404 have been 
processed. If not, execution continues at step 415. Otherwise, the reconstructed object is used to replace the old copy 
of the object O^^ on device A in step 420. 

Example Recursive Signature Transfer Process Flow 

20 

[0046] FIGS. 5A and 5B are diagrams illustrating process flows for recursive transfer of the signature and chunk length 
list in an example RDC procedure that is arranged according to at least one aspect of the present invention. The below 
described procedure may be applied to both the local and remote devices that are attempting to update commonly 
associated objects. 

25 [0047] The left hand side of FIG. 5A illustrates steps 501 - 51 3 that are operated on the local device A, while the right 
hand side of FIG. 5A illustrates steps 551 - 556 that are operated on the remote device B. Steps 501 - 513 replace step 
404 in FIG. 4A while steps 551 - 556 replace step 454 in FIG. 4A. 

[0048] In steps 501 and 551 , both the local device A and remote device B independently compute recursive fingerprints 
of their signature and chunk length lists ((Sigy^i,Leny^i), ... (SigA,^,Len^^^)) and ((SigB-|,LenBi), ... (SigB^.i-enen)), respec- 
30 tively, that had been computed in steps 402/403 and 452/453, respectively. In steps 502 and 552 the devices divide 
their respective signature and chunk length lists into recursive chunks, and in steps 503 and 553 compute recursive 
signatures (e.g., SHA) for each recursive chunk, respectively. 

[0049] In step 554, device B sends the recursive signature and chunk length list computed in steps 552 and 553 to 

device A, which receives this information in step 504. 

35 [0050] In step 505, the local device A initializes the list of requested recursive chunks to the empty list, and initializes 
the tracking remote recursive offset for the remote recursive chunks to 0. In step 506, the next (recursive signature, 
recursive chunk length) pair (RSiggj, RLengj) is selected for consideration from the list received in step 504. In step 507, 
device A checks whether the recursive signature RSigej selected in step 506 matches any of the recursive signatures 
it computed during step 503. If it matches, execution continues at step 509. If it doesn't match, the tracking remote 

40 recursive chunk offset and the length in bytes RLengj are added to the request list in step 508. At step 509, the tracking 
remote recursive offset is incremented by the length of the current recursive chunk RLen^j. 

[0051] In step 51 0, the local device A tests whether all (recursive signature, recursive chunk length) pairs received in 
step 504 have been processed. If not, execution continues at step 506. Otherwise, the recursive chunk request list is 
compactly encoded, compressed, and sent to the remote device B at step 51 1 . 
45 [0052] The remote device B receives the compressed list of recursive chunks at step 555, uncompressed the list, then 
compresses and sends back the recursive chunk data at step 556. 

[0053] The local device receives and decompresses the requested recursive chunk data at step 512. Using the local 
copy of the signature and chunk length list ((Sig^^ ,Leny^i), ... (SigAj^,LenAj^)) and the received recursive chunk data, the 
local devices reassembles a local copy of the signature and chunk length list ((SigB-|,LenBi), ... (SigB^^Lenen)) at step 

50 513. Execution then continues at step 405 in FIG. 4A. 

[0054] FIG. 5B illustrates a detailed example for step 513 from FIG. 5A. Processing continues at step 514, where the 
local device A initializes the list of remote signatures and chunk lengths, SIGCL, to the empty list. 
[0055] In step 515, the next (recursive signature, recursive chunk length) pair (RSigBj, RLenBj) is selected for consid- 
eration from the list received in step 504. In step 51 6, device A checks whether the recursive signature RSiggj selected 

55 in step 515 matches any of the recursive signatures it computed during step 503. 

[0056] If it matches, execution continues at step 51 7, where device A appends the corresponding local recursive chunk 

to SIGCL. If it doesn't match, the remote received recursive chunk is appended to SIGCL at step 518. 

[0057] In step 51 9, the local device A tests whether all (recursive signature, recursive chunk length) pairs received in 
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step 504 have been processed. If not, execution continues at step 515. Otiierwise, tine local copy of the signature and 
chunk length list ((SigB-|,LenB-|), ... (Siggk^ LenBn)) is set to the value of SIGCL in step 520. Execution then continues 
back to step 405 in FIG. 4A. 

[0058] The recursive signature and chunk length list may optionally be evaluated to determine if additional recursive 
5 remote differential compression is necessary to minimize bandwidth utilization as previously described. The recursive 
signature and chunk length list can be recursively compressed using the described chunking procedure by replacing 
steps 504 and 554 with another instance of the RDC procedure, and so on, until the desired compression level is 
achieved. After the recursive signature list is sufficiently compressed, the recursive signature list is returned for trans- 
mission between the remote and local devices as previously described. 
10 [0059] FIG. 6 is a diagram that graphically illustrates an example of recursive compression in an example RDC 
sequence that is arranged in accordance with an example embodiment. For the example illustrated in FIG. 6, the original 
object is 9.1 GB of data. A signature and chunk length list is compiled using a chunking procedure, where the signature 
and chunk length list results in 3 million chunks (or a size of 42MB). After a first recursive step, the signature list is divided 
into 33 thousand chunks and reduced to a recursive signature and recursive chunk length list with size 33KB. By 
15 recursively compressing the signature list, bandwidth utilization for transferring the signature list is thus dramatically 
reduced, from 42MB to about 395KB. 

Example Object Updating 

20 [0060] FIG. 7 is a diagram illustrating the interaction of a client and server application using an example RDC procedure 
that is arranged according to at least one aspect of the present invention. The original file on both the server and the 
client contained text "The quick fox jumped over the lazy brown dog. The dog was so lazy that he didn't notice the fox 
jumping over him." 

[0061] At a subsequent time, the file on the server is updated to: "The quick fox jumped over the lazy brown dog. The 

25 brown dog was so lazy that he didn't notice the fox jumping over him." 

[0062] As described previously, the client periodically requests the file to be updated. The client and server both chunk 
the object (the text) into chunks as illustrated. On the client, the chunks are: "The quick fox jumped", "over the lazy brown 
dog.", "The dog was so lazy that he didn't notice", and "the fox jumping over him."; the client signature list is generated 
as: SHA-i-i, SHA-ig, SHA^^, and SHA-14. On the server, the chunks are: "The quick fox jumped", "over the lazy brown 

30 dog.", "The brown dog was", "so lazy that he didn't notice", and "the fox jumping over him." ; the server signature list is 
generated as: SHA21, SHA22, SHA23, SHA24, and SHA25. 

[0063] The server transmits the signature list (SHA21 - SHA25) using a recursive signature compression technique as 
previously described. The client recognizes that the locally stored signature list (SHA^i- SHA14) does not match the 
received signature list (SHA21 - SHA25), and requests the missing chunks 3 and 4 from the server. The server compresses 
35 and transmits chunks 3 and 4 ("The brown dog was", and "so lazy that he didn't notice"). The client receives the 
compressed chunks, decompresses them, and updates the file as illustrated in FIG. 7. 

Chunking Analysis 

40 [0064] The effectiveness of the basic RDC procedure described above may be increased by optimizing the chunking 
procedures that are used to chunk the object data and/or chunk the signature and chunk length lists. 
[0065] The basic RDC procedure has a network communication overhead cost that is identified by the sum of: 

(51 ) [Signatures and chunk lengths from B| = lO^I * |SigLen| / C, where |Og| is the size in bytes of Object O^, SigLen 
45 is the size in bytes of a (signature, chunk length) pair, and C is the expected average chunk size in bytes; and 

(52) E chunk_length, where (signature, chunk_length) e Signatures from B, 
and signature € Signatures from A 

[0066] The communication cost thus benefits from a large average chunk size and a large intersection between the 
50 remote and local chunks. The choice of how objects are cut into chunks determines the quality of the protocol. The local 
and remote device must agree, without prior communication, on where to cut an object. The following describes and 
analyzes various methods for finding cuts. 

[0067] The following characteristics are assumed to be known for the cutting algorithm: 

55 1 . Slack: The number of bytes required for chunks to reconcile between file differences. Consider sequences s1, 

s2, and s3, and form the two sequences s1 s3, s2s3 by concatenation. Generate the chunks for those two sequences 
Chunksl , and Chunks2. If Chunksl ' and Chunks2' are the sums of the chunk lengths from Chunksl and Chunks2, 
respectively, until the first common suffix is reached, the slack in bytes is given by the following formula: 
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slack = Chunksi' - |si| = Chunksa' - |s2| 



2. Average chunk size C: 

When Objects and Og have S segments in common with average size K, the number of chunks that can be 
obtained locally on the client is given by: 

S * L(K - slackycj 

and (S2) above rewrites to: 

|0a| - S * L(K - slackycj 

[0068] Thus, a chunking algorithm that minimizes slack will minimize the number of bytes sent over the wire. It is 
therefore advantageous to use chunking algorithms that minimize the expected slack. 

Fingerprinting Functions 

[0069] All chunking algorithms use a fingerprinting function, or hash, that depends on a small window, that is, a limited 
sequence of bytes. The execution time of the hash algorithms used for chunking is independent of the hash window 
size when those algorithms are amenable to finite differencing (strength reduction) optimizations. Thus, for a hash 
window of size k it is should be easy (require only a constant number of steps) to compute the hash #[b-|,... .b^.^, bj 
using bo , b^, and#[bo,bi, bj^.^] only. Various hashing functions can be employed such as hash functions using Rabin 
polynomials, as well as other hash functions that appear computationally more efficient based on tables of pre-computed 
random numbers. 

[0070] In one example, a 32 bit Adier hash based on the rolling checksum can be used as the hashing function for 
fingerprinting. This procedure provides a reasonably good random hash function by using a fixed table with 256 entries, 
each a precomputed 1 6 bit random number. The table is used to convert fingerprinted bytes into a random 1 6 bit number. 
The 32 bit hash is split into two 16 bit numbers sum1 and sum2, which are updated given the procedure: 

suml table[bk] - table[bo] 



sum2 suml - k* table[bo] 

[0071] In another example, a 64 bit random hash with cyclic shifting may be used as the hashing function for finger- 
printing. The period of a cyclic shift is bounded by the size of the hash value. Thus, using a 64 bit hash value sets the 
period of the hash to 64. The procedure for updating the hash is given as: 



hash = hash ^ ((table[bo] « 1) | (table[bo] » u)) ^ table[bk]; 



hash - (hash « 1) | (hash » 63); 

where 1 = k % 64 and u = 64 - 1 

[0072] In still another example, other shifting methods may be employed to provide fingerprinting. Straight forward 
cyclic shifting produces a period of limited length, and is bounded by the size of the hash value. Other permutations 
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have longer periods. For instance, tine permutation given by tine cycles (1 23 0) (5 67891011 121314 4) (161718 
19 20 21 15) (23 24 25 26 22) (28 29 27) (31 30) has a period of length 4*3*5*7*1 1 = 4620. The single application of 
this example permutation can be computed using a right shift followed by operations that patch up the positions at the 
beginning of each interval. 

Analysis of previous art for chunking at pre-determined patterns 

[0073] Previous chunking methods are determined by computing a fingerprinting hash with a pre-determined window 
size k (= 48), and identifying cut points based on whether a subset of the hash bits match a pre-determined pattern. 
With random hash values, this pattern may as well be 0, and the relevant subset may as well be a prefix of the hash. In 
basic instructions, this translates to a predicate of the form: 



CutPoint(hash) - 0 - - (hash & ((1 « c) -1)), 
where c is the number of bits that are to be matched against. 

[0074] Since the probability for a match given a random hash function is 2"^, an average chunk size C = 2^ results. 
However, neither the minimal, nor the maximal chunk size is determined by this procedure. If a minimal chunk length of 
m is imposed, then the average chunk size is: 

C = m + 2' 

[0075] A rough estimate of the expected slack is obtained by considering streams s^Sg and SgSg. Cut points in s-, and 
S2 may appear at arbitrary places. Since the average chunk length is C = m + 2°, about (2° /Cf of the last cut-points in 
s^ and S2 will be beyond distance m. They will contribute to slack at around 2°. The remaining 1 - (2° /C^ contribute 
with slack of length about C. The expected slack will then be around (2° /Cf + (1 - (2°/C;R)*(C/C) (2°/Cf + 1 - (2°/C:F, 
which has global minimum for m = 2'^'', with a value of about 23/27 = 0.85. A more precise analysis gives a somewhat 
lower estimate for the remaining 1 - {2°/Cf fraction, but will also need to compensate for cuts within distance m inside 
S3, which contributes to a higher estimate. 

Thus, the expected slack for the prior art is approximately 0.85 * C. 
Chunking at Filters (New Art) 

[0076] Chunking at filters is based on fixing a filter, which is a sequence of patterns of length m, and matching the 
sequence of fingerprinting hashes against the filter. When the filter does not allow a sequence of hashes to match both 
a prefix and a suffix of the filter it can be inferred that the minimal distance between any two matches must be at least 
m. An example filter may be obtained from the CutPoint predicate used in the previous art, by setting the first m - 1 
patterns to 

0 !=(hash&((l «c)-l)) 



and the last pattern to: 

0 = = (hash&((l «c)-l)). 

[0077] The probability for matching this filter is given by (1 - p)^'' p where pis 2-^. One may compute that the expected 
chunk length is given by the inverse of the probability for matching a filter (it is required that the filter not allow a sequence 
to match both a prefix and suffix), thus the expected length of the example filter is (1 -p)-^+"' p-"" . This length is minimized 
when setting p := 1/m, and it turns out to be around (e * m). The average slack hovers around 0.8, as can be verified by 
those skilled in the art. An alternative embodiment of this method uses a pattern that works directly with the raw input 
and does not use rolling hashes. 
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Chunking at Local Maxima (New Art) 

[0078] Chunking at Local Maxima is based on choosing as cut points positions that are maximal within a bounded 
horizon. In the following, we shall use h for the value of the horizon. We say that the hash at position offset is an /7-local 
5 maximum if the hash values at offsets offset-h, offset-1, as well as offset+ 1,..., offset+h are all smaller than the hash 
value at offset. In other words, all positions h steps to the left and h steps to the right have lesser hash values. Those 
skilled in the art will recognize that local maxima may be replaced by local minima or any other metric based comparison 
(such as "closest to the median hash value"). 

[0079] The set of local maxima for an object of size n may be computed in time bounded by 2«n operations such that 
10 the cost of computing the set of local maxima is close to or the same as the cost of computing the cut-points based on 
independent chunking. Chunks generated using local maxima always have a minimal size corresponding to h, with an 
average size of approximately 2/7+1 . A CutPoint procedure is illustrated in FIGS. 8 and 9, and is described as follows below: 

1 . Allocate an array M of length h whose entries are initialized with the record {isMax=false, hash=0, offset=0}. The 
15 first entry in each field (isMax) indicates whether a candidate can be a local maximum. The second field entry (hash) 

indicates the hash value associated with that entry, and is initialized to 0 (or alternatively, to a maximal possible 
hash value). The last field (offset) in the entry indicates the absolute offset in bytes to the candidate into the finger- 
printed object. 

2. Initialize offsets min and max into the array M to 0. These variables point to the first and last elements of the array 
20 that are currently being used. 

3. CutPoint(hash, offset) starts at step 800 in FIG. 8 and is invoked at each offset of the object to update M and 
return a result indicating whether a particular offset is a outpoint. 

The procedure starts by setting result = false at step 801. 

At step 803, the procedure checks whether M[max]. offset + /? + 1 = offseL If this condition is true, execution continues 
25 at step 804 where the following assignments are performed: result is set to M[max].isMax, and max is set to max- 

1 % h. Execution then continues at step 805. If the condition at step 803 is false, execution continues at step 805. 
At step 805, the procedure checks whether M[min].hash > hash. If the condition is true, execution continues at step 
806, where min is set to (min-1) % h. Execution the continues at step 807 where l\/l[min] is set to {isMax = false, 
hash =hash, offset=offset}, and to step 81 1, where the computed result is returned. 
30 If the condition at step 805 is false, execution continues to step 808, where the procedure checks for whether M 

[min]. hash = hash. If this condition is true, execution continues at step 807. 

If the condition at step 808 is false, execution continues at step 809, where the procedure checks whether min = 
max. If this condition is true, execution continues at step 810, where M[min] is set to {isMax = true, hash =hash, 
offset=offset}. Execution then continues at step 81 1 , where the computed result is returned. 
35 If the condition at step 809 is false, execution continues at step 81 1 , where min is set to (min+1) % h. Execution 

then continues back at step 805. 

4. When CutPoint(hash, offset) returns true, it will be the case that the offset at position offset-/7-1 is a new cut-point. 
Analysis of Local Maximum Procedure 

40 

[0080] An object with n bytes is processed by calling CutPoint n times such that at most n entries are inserted for a 
given object. One entry is removed each time the loop starting at step 805 is repeated such that there are no more than 
n entries to delete. Thus, the processing loop may be entered once for every entry and the combined numberof repetitions 
may be at most n. This implies that the average number of steps within the loop at each call to CutPoint is slightly less 

45 than 2, and the number of steps to compute cut points is independent of h. 

[0081] Since the hash values from the elements form a descending chain between min and max, we will see that the 
average distance between min and max (|min - max| % h) is given by the natural logarithm of h. Offsets not included 
between two adjacent entries in M have hash values that are less than or equal to the two entries. The average length 
of such chains is given by the recurrence equation f(n) = 1 +^/rl*Y.w<n ^i^)- The average length of the longest descending 

50 chain on an interval of length n is 1 greater than the average length of the longest descending chain starting from the 
position of the largest element, where the largest element may be found at arbitrary positions with a probability of 1/n. 
The recurrence relation has as solution corresponding to the harmonic number = 1 + 72 + 1/3 + Va +....+ 1/n, which 
can be validated by substituting H^, into the equation and performing induction on n. is proportional to the natural 
logarithm of n. Thus, although array M is allocated with size h, only a small fraction of size \n{h) is ever used at anyone time. 

55 [0082] Computing min and max with modulus h permits arbitrary growth of the used intervals of M as long as the 
distance between the numbers remain within h. 

[0083] The choice of initial values for M implies that cut-points may be generated within the first h offsets. The algorithm 
can be adapted to avoid cut-points at these first h offsets. 
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[0084] The expected size of the chunks generated by this procedure is around 2h-h^ . We obtain this number from the 
probability that a given position is a cut-point. Suppose the hash has m different possible values. Then the probability 
is determined by: 

5 

i:o<k<™ l/m(k/m)"'. 

[0085] Approximating using integration Iq <x<m^^^ (x/m)2^dx = 1 /(2/7+1 ) indicates the probability when m is sufficiently 
10 large. 

[0086] The probability can be computed more precisely by first simplifying the sum to: 



(1/m) Zo:^k<m k , 



which using Bernoulli numbers expands to: 

20 

1/(2/1+1) Zo^k<2;, (2h+l)m (2/1+1 -k)! Bwrn' 



25 The only odd Bernoulli number that is non-zero is B^, which has a corresponding value of - 1/2. The even Bernoulli 
numbers satisfy the equation: 



30 Hco^^n) ^ ^_ J 22n-l ^2n / ^2n) ! 

[0087] The left hand side represents the infinite sum 1 + (1/2)2n + (1/3)2n + which for even moderate values of n 
is very close to 1 . 

35 When m is much larger than h, all of the terms, except for the first can be ignored, as we saw by integration. They are 
given by a constant between 0 and 1 multiplied by a term proportional to h^~^ I xx^. The first term (where Bq = 1 ) simplifies 
to 1/(2/7+1). (the second term is -1/(2m), the third is /7/(6m2)). 

[0088] For a rough estimate of the expected slack consider streams s-jSg and SgSg. The last cut points inside s^ and 

S2 may appear at arbitrary places. Since the average chunk length is about 2h +1 about V^: th of the last cut-points will 
40 be within distance h in both s-, and S2. They will contribute to cut-points at around 7/8 h. In another V2. of the cases, one 
cut-point will be within distance h the other beyond distance h. These contribute with cut-points around %/7. The remaining 
y4'th of the last cut-points in s-, and Sg will be in distance larger than h. The expected slack will therefore be around %* 
7/8 + V2. * % + * Va = 0.66. 

[0089] Thus, the expected slack for our independent chunking approach is 0.66 * C, which is an improvement over 
45 the prior art (0.85 * C). 

[0090] There is an alternate way of identifying cut-points that require executing in average fewer instructions while 

using space at most proportional to h, or in average In h. The procedure above inserts entries for every position 0..n-1 
in a stream of length n. The basic idea in the alternate procedure is to only update when encountering elements of an 
ascending chain within intervals of length h. We observed that there will in average only be In /7 such updates per interval. 

50 Furthermore, by comparing the local maxima in two consecutive intervals of length h one can determine whether each 
of the two local maxima may also be an h local maximum. There is one peculiarity with the alternate procedure; it requires 
computing the ascending chains by traversing the stream in blocks of size h, each block gets traversed in reverse direction. 
[0091] In the alternate procedure (see FIGS. 10 and 1 1), we assume for simplicity that a stream of hashes is given 
as a sequence. The subroutine CutPoint gets called for each subsequence of length h (expanded to "horizon" in the 

55 Figures). It returns zero or one offsets which are determined to be cut-points. Only \n(h) of the calls to Insert \n'\\\ pass 
the first test. 

[0092] Insertion into A is achieved by testing the hash value at the offset against the largest entry in A so far. 
[0093] The loop that updates both A [k] and B [k]. is Max can be optimized such that in average only one test is 
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performed in the loop body. The case B[l]. hash <= A[k]. hash and Bp], is IVlax is handled in two loops, the first checks 
the hash value against B [1]. hash until it is not less, the second updates A[k]. The other case can be handled using a 
loop that only updates A[k] followed by an update to B ["!]. is Max. 

[0094] Each call to CufPo/nf requires in average In h memory writes to A, and with loop hoisting h+\n h comparisons 
related to finding maxima. The last update to A[k]. is Max may be performed by binary search or by traversing B starting 
from index 0 in at average at most log In h steps. Each call to CutPoint a\so requires re-computing the rolling hash at 
the /asf position In the window being updated. This takes as many steps as the size of the rolling hash window. 

Observed Benefits of the Improved Chunking Algorithms 

[0095] The minimal chunk size is built into both the local maxima and the filter methods described above. The con- 
ventional implementations require that the minimal chunk size Is supplied separately with an extra parameter. 

[0096] The local max (or mathematical) based methods produce measurable better slack estimate, which translates 
to further compression over the network. The filter method also produces better slack performance than the conventional 
methods. 

[0097] Both of the new methods have a locality property of cut points. All cut points inside s3 that are beyond horizon 
will be cut points for both streams s1 s3 and s2s3. (In other words, consider stream s1 s3, If p is a position > |s1 |+horlzon 
and p is a cut point in s1 s3, then it is also a cut point in s2s3. The same property holds the other direction (symmetrically), 
if p is a cut point in s2s3, then it is also a cut point in s1s3). This is not the case for the conventional methods, where 
the requirement that cuts be beyond some minimal chunk size may Interfere adversely. 

Alternative Mathematical functions 

[0098] Although the above-described chunking procedures describe a means for locating cut-points using a local 
maxima calculation, the present Invention is not so limited. Any mathematical function can be arranged to examine 
potential cut-points. Each potential cut-point is evaluated by evaluating hash values that are located within the horizon 
window about a considered cut-point. The evaluation of the hash values is accomplished by the mathematical function, 
which may include at least one of locating a maximum value within the horizon, locating a minimum values within the 
horizon, evaluating a difference between hash values, evaluating a difference of hash values and comparing the result 
against an arbitrary constant, as well as some other mathematical or statistical function. 

[0099] The particular mathematical function described previously for local maxima is a binary predicate "_ > _". For 
the case where p is an offset in the object, p is chosen as a cut-point if hashp > hash^, for all k, where p-harizan < k < 
p, or p < k < p+horlzon. However, the binary predicate > can be replaced with any other mathematical function without 
deviating from the spirit of the Invention. 

Finding Candidate Objects for Remote Differential Compression 

[01 00] The effectiveness of the basic RDC procedure described above may be Increased by finding candidate objects 

on the receiver, for signature and chunk reuse during steps 4 and 8 of the RDC algorithm, respectively. The algorithm 
helps Device A identify a small subset of objects denoted by: O^^, 0/^2' ^An ^^^^ similar to the object Og that 
needs to be transferred from Device B using the RDC algorithm. O^^ , Op^, 0^^ P^"^ objects that are already 
stored on Device A. 

[0101] The similarity between two objects and Is measured in terms of the number of distinct chunks that the 

two objects share divided by the total number of distinct chunks in the first object. Thus if Chunks(Og) and Chunks(O^) 
are the sets of chunks computed for Og and of the RDC algorithm, respectively, then, using the notation |X| to denote 
the cardinality, or number of elements, of set X: 



[0102] As a proxy for chunk equality, the equality on the signatures of the chunks is used. This is highly accurate If 
the signatures are computed using a cryptographically secure hash function (such as SHA-1 or MD5), given that the 
probability of a hash collision Is extremely low. Thus, if Sign atu res (Og) and Signatures (O^) arethesetsof chunkslgnatures 
computed for Og and In the chunking portion of the RDC algorithm, then: 



Similarity' 
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Similarity 



(°b'°a)= 




SigB=5ig^}[ 



[0103] Given an object Og and the set of objects Objects^ tliat are stored on Device A, tlie members of Objects^ tinat 
Inave a degree of similarity with Og which exceeds a given threshold s are identified. A typical value for s may be s = 
0.5, (50% similairlty) i.e. we are interested in objects that have at least half of their chunks in common with Og. The 
value for s, however, may be set at any value that makes sense for the application. For example, s could be set between 
0.01 and 1.0 (1% similar to 100% similar). This set of objects is defined as: 



[0104] The set of objects O^-,, O^, O^^ computed as a subset of Similar(OB, Objects^, s) by taking the best n 
matches. 

[0105] The basic RDC algorithm described above is modified as follows to identify and use the set of similar objects 



[01 06] FIGURE 1 2 illustrates an RDC algorithm modified to find and use candidate objects, in accordance with aspects 
of the invention. The protocol for finding and using candidate objects on Device A and the transferring the updated object 
Og from device B to device A is described. A similar protocol may be used to transfer an object from device A to device 
B, and the transfer can be initiated at the behest of either device A or device B without significantly changing the protocol 
described below. 

1 . Device A sends device B a request to transfer Object Og using the RDC protocol. 

1 .5 Device B sends Device A a set of traits of Object Og, Traits(OB). Generally, the traits are a compact representation 
of the characteristics relating to object Og. As will be described later, Device B may cache the traits for Og so that 
it does not need to recompute them prior to sending them to Device A. 

1 .6. Device A uses Traits(OB) to identify O^i, 0^2, •■■ , O^p, a subset of the objects that it already stores, that are 
similar to Object Og. This determination is made in a probabilistic manner. 

2. Device A partitions the identified Objects O^i , 0^2 O^n into chunks. The partitioning occurs in adata-dependent 

fashion, by using a fingerprinting function that is computed at every byte position of the objects. A chunk boundary 
is determined at positions for which the fingerprinting function satisfies a given condition. Following the partitioning 
into chunks. Device A computes a signature Sig^ik for each chunk k of each Object O^j. 

3. Using a similar approach as in step 2, Device B partitions Object Og into chunks, and computes the signatures 
Siggj for each of the chunks. The partitioning algorithm used in step 3 must match the one in step 2 above. 

4. Device B sends list of chunk signatures (Sigg^ ... Sigg^) to Device A This list provides the basis for Device A being 
able to reconstruct Object Og. In addition to the chunk signatures Siggj, information will be sent about the offset and 
length of each chunk in Object Og. 

5. As Device A receives the chunk signatures from Device B, it compares the received signatures against the set 
of signatures (Sig^-,-,, ... Sig^-if^, ... , Sig^n-,, ... Sig^p-,) that it has computed in step 2. As part of this comparison. 
Device A records every distinct signature value it received from Device B that does not match one of its own signatures 
Sig^ik computed on the chunks of Objects O^i, O^g' ^An- 

6. Device A sends a request to Device B for all the chunks whose signatures were received in the previous step 
from Device B, but which did not have a matching signature on Device A. The chunks are requested by offset and 
length in Object Og, based on corresponding information that was sent in Step 4. 

7. Device B sends the content associated with all the requested chunks to device A. 

8. Device A reconstructs Object Og by using the chunks received in step 6 from Device B, as well as its own chunks 

of objects O^^, 0^2, ■■■ , O^n that matched signatures sent by Device B in step 4. After this reconstruction step is 
complete. Device A may now add the reconstructed copy of Object Og to its already stored objects. 

[0107] To minimize network traffic and CPU overhead, Traits(Og) should be very small and the determination of the 
set of similar objects O^i , 0^2, O^n be performed with very few operations on Device A. 




O, 



'An- 



14 



EP 1 641 219 A2 



Computing the set of traits for an object 

[0108] The set of traits for a object O, Traits(O), is computed based on tlie cliunk signatures computed for O, as 
described for steps 2 or 3 of the RDC algorithm, respectively. 
5 [01 09] FIGURES 1 3 and 1 4 show a process and an example of a trait computation, in accordance with aspects of the 
invention. 

[01 10] The algorithm for identifying similar objects has four main parameters (q, b, t, x) that are summarized below. 

q : Shingle size 
10 b: Number of bits per trait 

t : Number of traits per object 

X : Minimum number of matching traits 

[01 1 1] The following steps are used to compute the traits for object O, Traits(O). 

15 

1 . At block 1310, the chunk signatures of O, Sig^ ... Sign are grouped together into overlapping shingles of size q, 
where every shingle comprises q chunk signatures, with the exception of the last q-1 shingles, which will contain 
fewer than q signatures. Other groupings (discontiguous subsets, disjoint subsets, etc.) are possible, but it is prac- 
tically useful that inserting an extra signature causes all of the previously considered subsets to still be considered. 
20 2. At block 1320, for each shingle 1 ... n, a shingle signature Shingle^ ... Shingle^ is computed by concatenating the 

q chunk signatures forming the shingle. For the case where q = 1 , Shingle^ = Sig^ , ... , Shinglep, = Sigp. 

3. At block 1330, the shingle set {Shingle-, ... Shingle^} is mapped into t image sets through the application of t hash 
functions H-, ... H^. This generates t image sets, each containing n elements: 

ISi ={Hi(Shingle^), Hi(Shingle2),...,Hi(ShingleJ} 

25 

\S^ = {H^(Shinglei), Ht(Shingle2),...,Ht(Shinglen)} 

4. At block 1340, the pre-traits PT-, ... PT^ are computed by taking the minimum element of each image set: 
PTi = min(ISi) 

30 PT^ = min(IS^) Other deterministic mathematical functions may also be used to compute the pre-traits. For example, 

the pre-traits PT-, ... PT^ are computed by taking the maximum element of each image set: 
PTi = max(ISi) 

PT^ = max(IS^) Mathematically, any mapping carrying values into a well-ordered set will suffice, max and min on 
35 bounded integers being two simple realizations. 

5. At block 1350, the traits T^ ... T^ are computed by selecting b bits out of each pre-trait PT^ ... PT^. To preserve 
independence of the samples, it is better to choose non-overlapping slices of bits, 0..b-1 for the first, b..2b-1 for the 
second, etc, if the pre-traits are sufficiently long: 

Ti =selecto.b-i(PTi). 

= se\ect^^_^^^_^^_^{PT^). Any deterministic function may be used to create traits that are smaller in size than the 
pre-traits. For instance, a hash function could be applied to each of the pre-traits so long as the size of the result is 
smaller than the pre-trait; if the total number of bits needed (tb) exceeds the size of a pre-trait, some hash functions 
should be used to expand the number of bits before selecting subsets. 

45 

[01 12] The number of traits t and the trait size b are chosen so that only a small total number of bits (t * b) is needed 

to represent the traits for an object. This is advantageous if the traits are precomputed and cached by Device A, as will 
be described below. According to one embodiment, some typical combinations of (b,t) parameters that have been found 
to work well are e.g. (4,24) and (6,16), for a total of 96 bits per object. Any other combinations may also be used. For 
50 purposes of explanation, the i^^ trait of object A will be denoted by Tj(A). 

Efficiently Selecting the Pre-traits 

[0113] To efficiently select the pre-traits PT^ ... PT^, the following approach is used, allowing partial evaluation of the 
55 shingles, and thus reducing the computational requirements for selecting the pre-traits. Logically, each Hj is divided into 
two parts, Highj and Low,. Since only the minimum element of each image set is selected, the Highj is computed for 
every chunk signature and the LoWj is computed only for those chunk signatures which achieve the minimum value ever 
achieved for Highj. If the High values are drawn from a smaller space, this may save computation. If, further, several 
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High values are bundled together, significant computation may be saved. Suppose, for instance, that each High value 
is 8 bits long. Eight of these can be packed into a long integer; at the cost of computing a single 8-byte hash from a 
signature, that value can be chopped into eight independent one byte-slices. If only the High value were needed, this 
would reduce computational costs by a factor of eight. However, on average one time in 256 a corresponding Low value 
needs to be computed and compared to other Low values corresponding to equal High values. 

Finding Similar Objects Using the Sets of Traits 

[0114] The algorithm approximates the set of objects similar to a given object Og by computing the set of objects 
having similar traits to Og: 



[01 15] Other computations from which these values might be derived would work just as well. 
[01 16] To select the n most similar objects to a given object Og, SimilarTraits(OB, Objects;^, x) is computed and the 
n best matching objects out of that set are taken. If the size of SimilarTraits(Og, Objects;^, x) is smaller than n, the entire 
set is taken. The resulting set of objects forms a potential set of objects O/^^, Ofi^, O^p, identified in step 1.6 of the 
modified RDC algorithm illustrated in FIGURE 12. According to the embodiments, objects may be chosen guided by 
similarity, but trying also to increase diversity in the set of objects by choosing objects similar to the target, but dissimilar 
from one another, or by making other choices from the set of objects with similar traits. 

[0117] According to one embodiment, the following combinations of parameters (q,b,t,x) may be used: 
(q=1,b=4,t=24,x=9) and (q=1 ,b=6,t=1 6,x=5). 

[01 18] FIGURES 15 and 16 may be used when selecting the parameters for b and t, in accordance with aspects of 
the present invention. The curves for the probability of detecting matches and for false positives first for (b=4, t=24) is 
shown in FIGURE 15, and then for (b=6, t=16) is shown in FIGURE 16. Both sets of similarity curves (1510 and 1610) 
allow the probabilistic detection similar objects with true similarity in the range of 0-1 00%. According to one embodiment, 
the false positive rate illustrated in displays 1520 and 1620 drops to an acceptable level at roughly 10 of 24 (providing 
40 bits of true match), and at 6 of 16 (36 bits of match); the difference in the required number of bits is primarily due to 
the reduced number of combinations drawing from a smaller set. The advantage of the larger set is increased recall: 
fewer useful matches will escape attention; the cost is the increased rate of falsely detected matches. To improve both 
precision and recall, the total number of bits may be increased. Switching to (b=5, t=24), for instance would dramatically 
improve precision, at the cost of increasing memory consumption for object traits. 

A Compact Representation for tlie Sets of Traits 

[01 19] It is advantageous for both Device A and Device B to cache the sets of traits for all of their stored objects so 
that they don't have to recompute their traits every time they execute steps 1.6 and 1.5, respectively, of the modified 
RDC algorithm (See FIGURE 12 and related discusssion). To speed up the RDC computation, the trait information may 
be stored in Device A's and Device B's memory, respectively. 

[0120] The representation described below uses on the order of t+p memory bytes per object, where t is the number 
of traits and p is the number of bytes required to store a reference or a pointer to the object. Examples of references 
are file paths, file identifiers, or object identifiers. For typical values of t and p, this approach can support one million 
objects using less than 50MB of main memory. If a device stores more objects, it may use a heuristic to prune the number 
of objects that are involved in the similarity computation. For instance, very small objects may be eliminated a priori 
because they cannot contribute too many chunks in steps 4 and 8 of the RDC algorithm illustrated in FIGURE 12. 
[0121] FIGURE 1 7 illustrates data structures that make up a compact representation of: an ObjectMap and a set of t 
Trait Tables, in accordance with aspects of the invention. 

[0122] Initially, short identifiers, or object IDs, are assigned to all of the objects. According to one embodiment, these 
identifiers are consecutive non-negative 4-byte integers, thus allowing the representation of up to 4 Billion objects. 
[0123] A data structure (ObjectMap) maintains the mapping from object IDs to object references. It does not matter 
in which order objects stored on a device get assigned object IDs. Initially, this assignment can be done by simply 




Stmi!arTraits\0^, Objects^, xy= i*^A^*~^A^ Objects^ a TraitSimilarity\Og j > t 
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scanning through the device's list of stored objects. If an object gets deleted, its corresponding entry in ObjectMap is 
marked as a dead entry (by using a reserved value for the object reference). If an object is modified, it corresponding 
entry in ObjectMap is marked as a dead entry, and the object gets assigned the next higher unused object ID. 
[0124] When the ObjectMap becomes too sparse (something that can be easily determined by keeping track of the 
total size and the number of dead entries), both the ObjectMap and the Trait Tables are discarded and rebuilt from scratch. 
[0125] The Trait Tables form a two-level index that maps from a trait number (1 to t) and a trait value (0 to 2'^-1) to a 
TraitSet, the set of object IDs for the objects having that particular trait. A TraitSet is represented as an array with some 
unused entries at the end for storing new objects. An index IXj ^ keeps track of the first unused entry in each TraitSet 
array to allow for appends. 

[0126] Within a TraitSet, a particular set of objects is stored in ascending order of object IDs. Because the space of 
object IDs is kept dense, consecutive entries in the TraitSets can be expected to be "close" to each other in the object 
ID space - on average, two consecutive entries should differ by about t*2'=' (but by at least 1). If the values oft and b are 
chosen so that t*2'^ « 255, then consecutive entries can be encoded using on average only one unsigned byte representing 
the difference between the two object ID, as shown in FIGURE 1 7. An escape mechanism is provided by using the 0x00 
byte to indicate that a full 4-byte object ID follows next, for the rare cases where the two consecutive object IDs differ 
by more than 255. 

[0127] According to a different embodiment, if an object ID difference is smaller than 256 then it can be represented 
as a single byte, otherwise the value zero is reserved to indicate that subsequent bytes represent the delta minus 256, 
say, by using a 7 in 8 representation. Then, for b=6, 98% of deltas will fit in one byte, 99.7% fit in two bytes, and all but 
twice in a billion into three bytes. It has been found that this scheme uses on average 1 .02 bytes per object, compared 
to 1 .08 bytes per object for the scheme shown in FIGURE 17. 

[0128] Entries in the Trait Tables corresponding to dead object IDs can be left in the Trait Tables. New entries are 
appended at the end (using indices IX^ q ■■■ '^t,2'^"'')- 

Finding Similar Objects using the Compact Representation 

[01 29] FIGURE 1 8 illustrates a process for finding objects with similar traits, in accordance with aspects of the invention. 
According to one embodiment, to compute Similar Traits {Oq, Objects^, x), the steps are are similar to a merge sort 
algorithm. The algorithm uses (t-x+1) object buckets, OB^ ... OB^, that are used to store objects belonging to Objects;^ 
that match at least x and up to and including t traits of Og, respectively. 

1 . At block 1 81 0, select the t TraitSets corresponding to the t traits of Og: TS^ ... TS^. Initialize OB^ ... OB^ to empty. 
Initialize indices ... to point to the first element of TS-, ... TS^, respectively. TS|^[PJ is the notation for the object 
ID pointed to by P^^. 

2. At decision block 1 820, if all of P-, ... P^ point past the last element of their TraitSet arrays TS-, ...TS^, respectively, 
then go to step 6 (block 1860). 

3. At block 1 830, the MinP set is selected which is the set of indices pointing to the minimum object ID, as follows: 



Let MinID be the minimum object ID pointed to by all the indices in MinP. 

4. At block 1840, Let k= |MinP|, which corresponds to the number of matching traits. If k>x and if ObjectMap(MinP) 
is not a dead entry, then append MinID to OB|^. 

5. Advance every index P^ in MinP to the next object ID in its respective TraitSet array TS^. Go to step 2 (block 1 820). 

6. At block 1 860, select the similar objects by first selecting objects from OB^, then from OB^.-, , etc., until the desired 
number of similar objects has been selected or no more objects are left in OB^. The object IDs produced by the 
above steps can be easily mapped into object references by using the ObjectMap. 

[0130] The above specification, examples and data provide a complete description of the manufacture and use of the 

composition of the invention. Since many embodiments of the invention can be made without departing from the spirit 
and scope of the invention, the invention resides in the claims hereinafter appended. 




Claims 



1. A method for identifying objects for remote differential compression, comprising: 
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calculating traits for an object; 

using the traits to identify candidate objects that are at least somewhat similar to the object; and 
selecting final objects from the identified candidate objects. 

5 2. The method of Claim 1 , wherein calculating the traits for the object, further comprises: 

partitioning the object into chunks; 
computing signatures for each of the object chunks; 
grouping the signatures into shingles; 
10 computing at least one shingle signature for each of the shingles; 

mapping the shingle signatures into image sets; 
calculating p re-traits from the image sets; and 

computing the traits using the pre-traits, wherein the traits are smaller in size as compared to the pre-traits. 

15 3. The method of Claim 2, further comprising using the set of candidate objects when performing remote differential 
compression techniques. 

4. The method of Claim 2, further comprising: generating fingerprints at each byte position of the object by using the 
values of the bytes in a small window around each position. 

20 

5. The method of Claim 4, wherein partitioning the object into chunks comprises chunking the object based on the 
fingerprints. 

6. The method of Claim 2, wherein grouping the signatures into shingles consists of concatenating q consecutive 
25 signatures to form a shingle. 

7. The method of Claim 2, wherein mapping the shingles into image sets, comprises: applying ? hash functions to each 
shingle in turn, thereby creating f image sets. 

30 8. The method of Claim 2, wherein calculating the pre-traits from the image sets, comprises applying a deterministic 
mathematical function that selects one of the computed hash values from each image set. 

9. The method of Claim 8, wherein the deterministic mathematical function is selected from a maxima function and a 
minima function. 

35 

10. The method of Claim 7, wherein calculating the traits using the pre-traits, comprises applying a deterministic function 
to each of the pre-traits that creates traits each having a predetermined number of bits that is smaller than the traits. 

1 1 . The method of Claim 9, wherein the deterministic function, comprises selecting a specified set of b bits out of each 
40 pre-trait. 

12. The method of Claim 1, wherein using the traits to identify candidate objects that are at least somewhat similar to 
the object comprises calculating the number of traits that match between the two objects. 

45 13. The method of Claim 12, further comprising: selecting the k best-matching objects guided by the values of the trait 
similarity. 

14. The method of Claim 3, wherein the remote differential compression techniques, comprise: 

50 partitioning the k best matching local objects into local chunks; 

computing a local signature for each local chunk; 

comparing remote signatures received from a remote device to the local signatures; and 

reusing local chunks to reconstruct the remote object when the signature comparison indicates the local chunk 

may be reused. 



55 



1 5. The method of Claim 2, further comprising storing the traits within volatile memory of a device by compactly encoding 
the traits. 
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16. The method of Claim 15, further comprising storing within volatile memory of a device only a selected subset of the 
traits. 

17. The method of Claim 2, further comprising storing the traits within persistent memory of a device. 

5 

18. The method of Claim 1 7, further comprising stohng within persistent memory of a device only a selected subset of 
the traits. 

19. The method of Claim 15, further comprising: 

10 

creating an object map that compactly represents the object IDs on a device, wherein each compact represen- 
tation is represented using a predetermined size; and 

creating trait tables that form at least a two-level index that maps from a trait number and a trait value to a trait set. 
15 20. The method of Claim 19, further comprising rebuilding the object map when it determined that the object map is sparse. 

21. The method of Claim 19, further comprising: finding objects with similar traits, including steps for: 

creating buckets (OB^ ... OB^) to store local objects that match at least x traits of a remote object Og; 
20 selecting t TraitSets (TS-| ... TS^) corresponding to t traits of object 0^; 

initializing indices (P^ ... P^) to point to the first element of TS^ ...TS^, respectively, wherein TSJP J is the notation 
for the object ID pointed to by P,^; 

selecting a desired number of similar objects when it is determined that each of P^ ... P^ point past the last 
element of their TraitSet arrays TS^ ... TS^, respectively, 
25 selecting a MinP set, wherein the MinP set is the set of indices pointing to the minimum object ID; 

setting MinID to be the minimum object ID pointed to by all the indices in MinP. 

appending MinID to OB^^ when it is determined that k > x and ObjectMap(MinP) is not a dead entry, wherein k 
= |MinP|, which corresponds to the number of matching traits; and 

advancing each index P^ in MinP to the next object ID in its respective TraitSet array TS|^. 

30 

22. The method of Claim 21, wherein selecting a desired number of similar objects comprises first selecting objects 
from OB^, and decrementing fand selecting from OB, until the desired number of similar objects has been selected. 

23. The method of Claim 21, wherein selecting a desired number of similar objects comprises first selecting objects 
35 from OB^, and decrementing fand selecting from OB^. 

24. A computer-readable medium having computer executable instructions for identifying objects for remote differential 
compression, comprising: 

40 partitioning an object into chunks; 

computing signatures for each of the object chunks; 
grouping the signatures into shingles; 

computing at least one shingle signature for each of the shingles; 
mapping the shingle signatures into image sets; 
45 calculating pre-traits from the image sets; and 

computing the traits using the pre-traits, wherein the traits are smaller in size as compared to the pre-traits using 
the traits to identify candidate objects that are at least somewhat similar to the object; and 
selecting final objects from the identified candidate objects. 

50 25. The computer-readable medium of Claim 24, wherein partitioning the object into chunks comprises: generating 
fingerprints at each byte position of the object by using the values of the bytes in a small window around each 
position and chunking the object based on the fingerprints. 

26. The computer-readable medium of Claim 24, wherein grouping the signatures into shingles consists of concatenating 
55 signatures to form a shingle. 

27. The computer-readable medium of Claim 24, wherein mapping the shingles into image sets, comprises: applying 
at least one hash functions to each shingle. 
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28. The computer-readable medium of Claim 24, wherein calculating the pre-traits from the image sets, comprises 
applying a deterministic mathematical function that selects one of the computed hash values from each image set, 
wherein the deterministic mathematical function is selected from a maxima function and a minima function. 

5 29. The computer-readable medium of Claim 28, wherein calculating the traits using the pre-traits, comprises applying 
a deterministic function to each of the pre-traits that creates traits each having a predetermined number of bits that 
Is smaller than the traits. 

30. The computer-readable medium of Claim 24, wherein using the traits to Identify candidate objects that are at least 
10 somewhat similar to the object comprises calculating the number of traits that match between the two objects. 

31. The computer-readable medium of Claim 24, further comprising: 

creating an object map that compactly represents the object IDs on a device, wherein each compact represen- 
15 tation is represented using a predetermined size; and 

creating trait tables that form at least a two-level index that maps from a trait number and a trait value to a trait set. 

32. The computer-readable medium of Claim 31 , further comprising rebuilding the object map when It determined that 
the object map is sparse. 



20 



33. The computer-readable medlumof Claim 31 , further comprising: finding objects with similar traits. Including steps for: 



creating buckets (OB^ ... OB^) to store local objects that match at least x traits of a remote object Og; 
selecting t TraitSets (TS-, ... TS^) corresponding to t traits of object 0^; 
25 initializing Indices (P-i ... P^) to point to the first element of TS-, ...TS^, respectively, wherein TS^PJ is the notation 

for the object ID pointed to by P^; 

selecting a desired number of similar objects when it is determined that each of P-, ... P^ point past the last 
element of their TraitSet arrays TS-, ... TS^, respectively, 

selecting a MinP set, wherein the MinP set is the set of indices pointing to the minimum object ID; 
30 setting MInID to be the minimum object ID pointed to by all the Indices In MInP. 

appending MinID to OB^^ when it is determined that k > x and ObjectMap(MlnP) Is not a dead entry, wherein k 
= I MinP I, which corresponds to the number of matching traits; and 

advancing each Index P^ in MInP to the next object ID in Its respective TraitSet array TS^. 

35 34. A system for identifying objects for remote differential compression, comprising: 

a remote device configured to perform steps, comprising: 

receive a request for an Object O^; 
40 send a set of traits of Object O^ to a local device; 

partition Object Og into chunks and compute signatures for each of the chunks; 
send the list of chunk signatures to the local device; and 
provide requested chunks when requested; and 

45 the local device configured to perform steps, comprising: 

request Object Og from the remote device; 

receive the set of traits of Object Og from the remote device; 

use the set of traits of Object Og to Identify similar objects that It already stores on the local device; objects 
50 that it 

partition the similar objects Into chunks; 

compute signatures for each of the similar object chunks; 

receive the list of chunk signatures from the remote device; 

compare the received signatures against the locally computed signatures; 
55 request chunks from the remote device that did not match In the comparison; 

receive the requested chunks; and 

reconstruct Object Og using the received chunks and chunks reused from the similar objects. 
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35. The system of Claim 34, wherein the local device is further configured, to: 

group the signatures into shingles; 

compute at least one shingle signature for each of the shingles; 

5 map the shingle signatures into image sets; 

calculate pre-traits from the image sets; and 

compute the traits using the pre-traits, wherein the traits are smaller in size as compared to the pre-traits using 
the traits to identify candidate objects that are at least somewhat similar to the object; and 

10 select final objects from the identified candidate objects. 

36. The system of Claim 35, wherein partitioning the object into chunks comprises: generating fingerprints at each byte 
position of the object by using the values of the bytes in a small window around each position and chunking the 

object based on the fingerprints. 

15 

37. The system of Claim 36, wherein mapping the shingles into image sets on the local device, comprises: applying at 
least one hash functions to each shingle. 

38. The system of Claim 37, wherein calculating the pre-traits from the image sets, comprises applying a deterministic 
20 mathematical function that selects one of the computed hash values from each image set, wherein the deterministic 

mathematical function is selected from a maxima function and a minima function. 

39. The system of Claim 38, wherein calculating the traits using the pre-traits, comprises applying a deterministic function 
to each of the pre-traits that creates traits each having a predetermined number of bits that is smaller than the traits. 

25 

40. The system of Claim 36, wherein using the traits to identify candidate objects that are at least somewhat similar to 
the object comprises calculating the number of traits that match between the two objects. 

41. The system of Claim 36, wherein the local device is further configured to:: 

30 

create an object map that compactly represents the object IDs on the local device, wherein each compact 
representation is represented using a predetermined size; and 

create trait tables that form at least a two-level index that maps from a trait number and a trait value to a trait set. 
35 42. The system of Claim 41 , further comprising rebuilding the object map when it determined that the object map issparse. 

43. The system of Claim 41 , further comprising: finding objects with similar traits, including steps for: 

creating buckets (OB^ ... OB^) to store local objects that match at least x traits of a remote object O^; 
40 selecting t TraitSets (TS-, ... TS^) corresponding to t traits of object Og; 

initializing indices (P., ... P^) to point to the first element of TS-, ...TS^, respectively, wherein TSJP J is the notation 
for the object ID pointed to by P^; 

selecting a desired number of similar objects when it is determined that each of P-, ... P^ point past the last 
element of their TraitSet arrays TS-, ... TS^, respectively, 
45 selecting a MinP set, wherein the MinP set is the set of indices pointing to the minimum object ID; 

setting MinID to be the minimum object ID pointed to by all the indices in MinP. 

appending MinID to OB,^ when it is determined that k > x and ObjectMap(MinP) is not a dead entry, wherein k 
= |MinP|, which corresponds to the number of matching traits; and 

advancing each index P^ in MinP to the next object ID in its respective TraitSet array TS^. 

50 

44. The system of Claim 43, wherein selecting a desired number of similar objects comprises first selecting objects from 
OB^, and decrementing f and selecting from OB^ until the desired number of similar objects has been selected. 

45. The system of Claim 43, wherein selecting a desired number of similar objects comprises first selecting objects from 
55 OB^, and decrementing f and selecting from OB^. 
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CHUNKS. RSiqA1...RSiaAs 



RECEIVE RECURSIVE SIGNATURE 

AND CHUNK LENGTH LIST 
(RSigB1,RLenB1) ... (RSigBr.RLenBr) 



501 

502 

503 

504 
— 



SET REQUEST = Empty 
SET ROFFSET = 0 



505 



SELECT NEXT (RSigBi.RLenBi) 



506 



NO 




YES 



ADD (ROFFSET.RLenBi) 
TO REQUEST 



RUFFS 



Sbl KOl-hSbl = 
ROFFSET+RLenBi 



609 




RECEIVE AND 
DECOMPRESS RECURSIVE 
CHUNKS 



ASSEMBLE SIG 

t 

( TO STEP 405 ) 
Device A 



513 



551 



552 



553 



C FROM STEP 453 ) 



COMPUTE RECURSIVE 
FINGERPRINTS 



DIVIDE SIGNATURE AND CHUNK 
LIST INTO RECURSIVE CHUNKS 



COMPUTE RECURSIVE 

SIGNATURES FOR RECURSIVE 
CHUNKS. RSiaB1...RSiaBr 



554 



SEND RECURSIVE SIGNATURE AND 

CHUNK LENGTH LIST 
(RSigBi.RLenBi) ... (RSigBr.RLenBr) 



555 



RECEIVE REQUEST 



SEND COMPRESSED 
RECURSIVE CHUNKS 



FIG. 5A 



Device B 
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NO 



(^ASSEMBLE SIG J 513 



SET SIGCL = Empty 



514 
515 



SELECT NEXT (RSigBi.RLenBi) 




YES 



APPEND LOCAL 
RECURSIVE CHUNK 
TO SIGCL 



NO 



517 



u 518 



APPEND RECEIVED 
RECURSIVE CHUNK 
TO SIGCL 



519 



>ROCESSED ALD 
JRSigBI.RLenBi)? 



YES 



520 



SET SIGNATURE AND 
CHUNK LEGTH list TO 
SIGCL 



c 



I 



DONE 



0 



FIG. 5B 
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Original object: 
9.1GB 



1st levei SHAs: 
42iVIB 




2"d level SHAs: 
395KB 




3M 
Chunks 



33K 
Chunks 



FIG. 6 
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START 



CUTPOINT(HASH, OFFSET) 



i 



RESULT = FALSE 



800 



801 
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structure Entry 
var isMax as Boolean = false 
var hash as Integer = 0 
var offset as Integer = 0 

class LocalMaxCut 
h as Integer 
var nnin as Integer = 0 
var max as Integer = 0 
var M as Array of entry = new Entry[h] 

CutPoint(hash as Integer, offset as Integer) as Boolean 
var result = false 
step 

if M[max]. offset + h + 1 = offset then 
result := M[max]jsMax 
nnax := (max+1) mod h 

step 

while true do step 
step 

if M[nnin].hash > hash then 
step 

min := (mln-1) mod h 
step 

M[minl := Entry(false, hash, offset) 
return result 
if M[min].hash = hash then 

M[nnin] := Entry (false, hash, offset) 
return result 
if M[min].hash < hash and min = max then 
M[min] := Entry(true, hash, offset) 



FIG. 9 

return result 



step 
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structure Entry 
var offset as Integer = 0 
var isMax as Boolean = false 
var hash as Integer = 0 

class LocalMaxCut 

horizon as Integer 

var hashes as Seq of Integer 

var k as Integer = 0 

var I as Integer = 0 

var A as Array of Entry = new Entry[horizon] 
var B as Array of Entry = new Entry[horizon] 

CutPointsO as Seq of Integer 
var cuts as Seq of Integer = [] 
for window = 0 to Length(hashes)/horizon do step 
let first = winclow*horizon 

let last = min((window+1)*horizon,Length(hashes)))-1 
cuts := cuts + CutPoint(first, last) 
return cuts 

CutPoint(first as Integer, last as Integer) as Seq of Integer 

step 11 Initialize A with the first entry at the offset 
k:=0 

A[0] := Entry (last, true, hashes[last]) 
last := last - 1 

step // Update A[kj in the interval up to BpJ's horizon 
while last > B[l].offset + horizon do step 
Insert(last) 
last := last - 1 
step // Update A[k] and Bflj in the remaining interval 
while last >= first do step 
Insert(last) 

if B[l].hash <= hashes[last] then 

B[i].isMax := false 
last := last -1 

step determme whether \s a cutpomt with respect to B 
A[k]jsMax := A[k]JsMax and 
foraflj In {0 .1} holds 
(Bp]. offset + horizon < A[k].offeet or 
BOJ.hash < A[k].hash) 
step // Set B to- A for the next round and return cut-point 
B. I := A. k 

return if B[l]. IsMax then [B[i].offeetl else [J FIG. 



33 



EP 1 641 219 A2 



class LocalMaxCut 
lnsert(offset as Integer) 
If hashes[offset] >= A[k].hash then 
if hashes[offset] = A[k].hash then 

// ciupycated hashes within distance 
// of '^horizon'' are not maximal. 
A[k].isMax := false 
eSse 

A[k+1] := Entry (offset, true, hashes[offset]) 
k:=k + 1 



FIG. 11 
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Request Object Ob 





@ 


► 

Send a set of traits of Object Ob 





1.6 ) traite of Object Oe to identify Oai, Oa2, Oah, 
a subset of the objects that Is already stores that 
are similar to Object 

Partition identified Objects Oai, Oa2, Oan, and 
compute a signature SigAik for each chunk k of 




each Ob 



Device A 



ect Oai 



Using similar approach as step 2, partition 
Object Ob into chunks and compute signatures ( 3 ) 
Sigsj for each of the chunks 



0 send Plst of chunk signatu^s 



Device B 



©Compare received signatures against signatures 
computed in Step 2 





0 


Request chunks whose 
signatures did not match 





© 


► 

Send chunks 



Reconstruct Object Oe using received chunks 
and chunks reused from Oai, Oa2, Oah. 



FIG. 12 
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r Start J 

i 

Create Shingles for Object O 1310 

T 

Create Signature for Each ^ 220 
Shingle 

I 

Map Shingle Set into Image ^ 
Sets 

T 

Calculate pre-traits PTi PTt 1340 



Calculate traits Ti . . Tt 1 350 

I 



Q DONE ^ 

FIG. 13 
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q=4 in this example 



Sigi 



Sig2 



Sigs 



Sig4 



SlQe 



Siga 



Object O 



Shinglei 



Shing!e2 



Shingles 



Shingles 



Shinglee 



Shingles of O 



••• 



Shingle? 



Shingle4 



Shingles 



Traits(O) 



Hi(Shlnglei) 


Hi(Shlngle2) 




• 
• 
• 


Ht(Shinglei) 


HtCShinglGa) 





min ^ „ sell 



min ^ __ sel(t.i)b.,ii>i 
► K It ► 



T, 



FIG. 14 
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False positive detection, series A, 1520 




>=0/24 L184 
>=3/24L1B4 

>=6/24 L1B4 
>=9/24 L1B4 
>= 12/24 L1B4 
>=15/24L1B4 
>~ 18/24 L1B4 
>=21/24L1B4 
>=24/24 L1B4 



>-1/24 L1B4 
>=4/24 t1B4 

>=7/24 L1B4 
>= 10/24 t1B4 
->~13/24 L1B4 
>=16/24 L1B4 
>«19/24L1B4 
>=22/24L1B4 



>=2/24 L1B4 
>=5/24L1B4 

>=8/24 L1B4 
>=11/24 L1B4 
>= 14/24 L1B4 
>=17/24 L1B4 
->^20/24L1B4 
->=23/24 LI 84 



0% 10% 20% 30% 40% 50% 60% 

True stmilarity 



70% 



80% 



90% 



100% 



FIG. 15 
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SfmlSarlty curves, series B, 1610 




0% 



10% 20% 30% 40% 50% 

True similarity 



70% 80% 90% loos') 



False positive detection, series B, 1620 



Xi 
P 




0% 



10% 



20% 



30% 



>=0/16LtB6 
- >=3/16L1B6 
»~>=6/16L1B6 
>=9/16L1B6 
>=12/16L1B6 
>=15/16L1B6 
40% 50% 60% 

True similarity 



>=1/16L1B6 
>=4/16 L1B6 
>=7/16L1B6 
>=10/16L1B6 
>=13/16L1B6 
>=16/16L1B6 
70% 80% 



>=2/16L1B6 
>=5/16 L1B6 
>=8/16 L1B6 
>=11/16L1B6 
>=14/16 L1B6 



90% 



100% 



FIG. 16 
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ObjectMap: 



p bytes p bytes p bytes 



Refo 



0 
1 



Ref2 



(IXm 



dead entry 










1B 


1B 




56 


1B 










offs 


offs 


0x00 


Absolute object ID 


offs 





TraltSet 



TTt: 



0 
1 

2^1 



1B 


1B 




5B 


1B 










Offs 


Offs 


0x00 


Absolute object ID 


Offs 





TraltSet 



FIG. 17 
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C 



start 



I 



3 



Select Trait Sets 
Initialize indices 
Initialize object buckets 



1810 



1820 




Select MinIX Set 



Append MinID to OBk when 
determined 



I 



Advance Every Index Ixk in 
MinIX to the next object ID in 
its respective TSk 



1830 



1840 



1850 



Select Similar Objects 1 860 
( DONE J 

FIG. 18 
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